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Abstract Higher-level cognition includes logical reasoning 
and the ability of question answering with common sense. 
The RatioLog project addresses the problem of rational rea¬ 
soning in deep question answering by methods from auto¬ 
mated deduction and cognitive computing. In a first phase, 
we combine techniques from information retrieval and ma¬ 
chine learning to find appropriate answer candidates from 
the huge amount of text in the German version of the free 
encyclopedia “Wikipedia”. In a second phase, an automated 
theorem prover tries to verify the answer candidates on 
the basis of their logical representations. In a third phase 
— because the knowledge may be incomplete and inconsis¬ 
tent —, we consider extensions of logical reasoning to im¬ 
prove the results. In this context, we work toward the appli¬ 
cation of techniques from human reasoning: We employ de¬ 
feasible reasoning to compare the answers w.r.t. specificity, 
deontic logic, normative reasoning, and model construction. 
Moreover, we use integrated case-based reasoning and ma¬ 
chine learning techniques on the basis of the semantic struc¬ 
ture of the questions and answer candidates to learn giving 
the right answers. 
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1 Rational Reasoning and Question Answering 

The development of formal logic played a big role in the 
field of automated reasoning, which led to the development 
of the field of artificial intelligence (AI). Applications of au¬ 
tomated deduction in mathematics have been investigated 
from the early years on. Nowadays automated deduction 
techniques are successfully applied in hard- and software 
verification and many other areas (for an overview see 0). 

In contrast to formal logical reasoning, however, human 
reasoning does not strictly follow the rules of classical logic. 
Reasons may be incomplete knowledge, incorrect beliefs, 
and inconsistent norms. From the very beginning of AI re¬ 
search, there has been a strong emphasis on incorporating 
mechanisms for rationality, such as abductive or defeasible 
reasoning. From these efforts, as part of the field of knowl¬ 
edge representation, common-sense reasoning has emerged 
as a branching discipline with many applications in AI (m. 

Nowadays there is a chance to join automated deduction 
and common-sense reasoning within the paradigm of cog¬ 
nitive computing , which allows the implementation of ratio¬ 
nal reasoning o. The general motivation for the develop¬ 
ment of cognitive systems is that computers can solve well- 
defined mathematical problems with enormous precision at 
a reasonably sufficient speed in practice. It remains difficult, 
however, to solve problems that are only vaguely outlined. 
One important characteristic of cognitive computing is that 
many different knowledge formats and many different infor¬ 
mation processing methods are used in a combined fashion. 
Also the amount of knowledge is huge and, even worse, it 
is even increasing steadily. For the logical reasoning, a sim¬ 
ilar argument holds: Different reasoning mechanisms have 
to be employed and combined, such as classical deduction 
(forward reasoning) on the one hand, and abduction or other 
non-monotonic reasoning mechanisms on the other hand. 

Let us illustrate this with a well-known example from 
the literature: 
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Figure 1 The LogAnswer system uses information retrieval (IR), decision tree learning in a ranking phase, reasoning, and natural language answer 
generation to compute answers. 


1. Tom is an emu. 

2. Emus are birds. 

3. Birds normally fly. 

4. Emus do not fly. 


The question is: Can emus fly or not? Forward reason¬ 
ing allows us to infer that emus are birds and hence can nor¬ 
mally fly. This is in conflict, however, with the strict back¬ 
ground knowledge that emus do not fly. The conflict can be 
solved by assuming certain knowledge as default or defea¬ 
sible, which only holds normally. Hence we may conclude 
here that emus and therefore Tom does not fly. We will come 


back to this example later (namely in Section |2?2| and |2. 3). 

Rational reasoning must be able to deal with incom¬ 
plete as well as conflicting (or even inconsistent) knowledge. 
Moreover, huge knowledge bases with inconsistent contents 
must be handled. Therefore, it seems to be a good idea to 
combine and thus enhance rational reasoning by information 
retrieval techniques, e.g. techniques from machine learning. 
This holds especially for the domain of deep question an¬ 
swering, where communication with patterns of human rea¬ 
soning is desirable. 


1.1 Deep Question Answering and the LogAnswer System 

Typically, question answering systems, including applica¬ 
tion programs such as Okay Google® or Apple®’s Siri, 
communicate with the user in natural language. They ac¬ 
cept properly formulated questions and return concise an¬ 
swers. These automatically generated answers are usually 
not extracted directly from the web, but, in addition, the sys¬ 
tem operates on an extensive (background) knowledge base, 
which has been derived from textual sources in advance. 

LogAnswer 09) is an open-domain question answering 
system, accessible via a web interface (www. loganswer. 
de) similar to that of a search engine. The knowledge used 
to answer the question is gained from 29.1 million natural- 
language sentences of a snapshot of the German Wikipedia. 
Furthermore, a background knowledge consisting of 12,000 
logical facts and rules is used. The LogAnswer system was 
developed in the DFG-funded Log Answer project, a coop¬ 
eration between the groups on Intelligent Information and 


Communication Systems at the FernUniversitat Hagen and 
the AI research group at the University of Koblenz-Landau. 
The project aimed at the development of efficient and robust 
methods for logic-based question answering. The user enters 
a question and LogAnswer presents the five best answers 
from a snapshot of the German “Wikipedia”, highlighted in 
the context of the relevant textual sources. 

Most question answering systems rely on shallow lin¬ 
guistic methods for answer derivation, and there is only lit¬ 
tle effort to include semantics and logical reasoning. This 
may make it impossible for the system to find any answers: 
A superficial word matching algorithm is bound to fail if 
the textual sources use synonyms of the words in the ques¬ 
tion. Therefore, the LogAnswer system models some form 
of background knowledge, and combines cognitive aspects 
of linguistic analysis, such as semantic nets in a logical rep¬ 
resentation, with machine learning techniques for determin¬ 
ing the most appropriate answer candidate. 

Contrary to other systems, LogAnswer uses an auto¬ 
mated theorem prover to compute the replies, namely Hy¬ 
per m, an implementation of the hypertableaux calculus 
m, extended with equality among others. It has demon¬ 
strated its strength in particular for reasoning problems with 
a large number of irrelevant axioms, as they are character¬ 
istic for the setting of question answering. The logical rea¬ 
soning is done on the basis of a logical representation of 
the semantics of the entire text contained in the Wikipedia 
snapshot. This is computed beforehand with a system de¬ 
veloped by computational linguists 02) which employs the 
MultiNet graph formalism (Multilayered Extended Seman¬ 
tic Networks) E). 

Since methods from natural-language processing are of¬ 
ten confronted with flawed textual data, they strive toward 
robustness and speed, but often lack the ability to perform 
more complex inferences. By contrast, a theorem prover 
uses a sound calculus to derive precise proofs of a higher 
complexity; even minor flaws or omissions in the data, how¬ 
ever, lead to a failure of the entire derivation process. Thus, 
additional techniques from machine learning, defeasible and 
normative reasoning etc. should be applied to improve the 
quality of the answers — as done in the RatioLog project. 
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Figure 2 Techniques used in the different modules of the LogAnswer 
system. 


For this, the reasoning in classical logic is extended by 
various forms of non-monotonic aspects, such as defeasible 
argumentation. By these extensions, the open-domain ques¬ 
tion answering system LogAnswer is turned into a system 
for rational question answering, which offers a testbed for 
the evaluation of rational reasoning. 


1.2 The LogAnswer System and its Modules 

When processing a question, the LogAnswer system per¬ 
forms several different steps. Figure [T] presents details on 
these steps. At first, information retrieval (IR), i.e. pattern 
matching, is used to filter text passages suitable for the given 
question from the textual representation of the Wikipedia. 
For this, the text sources are segmented into sentence-sized 
passages. The corresponding representation can be enriched 
by the descriptors of other sentences that result from a coref¬ 
erence solution of pronouns, where the referred phrase is 
added to the description, if e.g. the pronoun ’he’ refers to the 
individual Tan Fleming’. Then decision tree learning ranks 
the text passages and chooses a set of answer candidates 
from these text passages (Ranking step in Figure [I]). Here, 
features like the number of matching lexemes between pas¬ 
sages and the question or the occurrences of proper names 
in the passage are computed. Up to 200 text passages are ex¬ 
tracted from the knowledge base according to this ranking. 

In the next step (Reasoning), the Hyper theorem prover 
is used to check if these text passages provide an answer to 
the question. For every answer candidate, a first-order logic 
representation of both the question and the answer candi¬ 
date is combined with a huge background knowledge. These 
proofs provide the answer to the question by means of vari¬ 
able assignments. The proofs for the answer candidates are 
then ranked again using decision tree learning (in the an¬ 
swer validation phase). For the five best answers, text pas¬ 
sages providing the answer are highlighted and presented to 


the user. This is done by a natural language (NL) answer 
generation module, which eventually yields the final answer 
candidates, in our case that Ian Fleming is a British author. 

In the LogAnswer system, various techniques work in¬ 
terlocked. See Figure[2]for an overview of the different tech¬ 
niques together with the modules in which they are used. Ex¬ 
traction of text passages for a certain question is performed 
in the candidate selection module. In this module, both in¬ 
formation retrieval and decision tree learning work hand in 
hand to find a list of answer candidates for the current ques¬ 
tion. For each answer candidate, the reasoning module is in¬ 
voked. This module consists of the Hyper theorem prover, 
which is used to check if the answer candidate provides 
an answer for the question. Since Hyper is able to handle 
first-order logic with equality and knowledge bases given in 
description logic, it is possible to incorporate background 
knowledge given in various (formal) languages. 

An interesting extension of usual background knowl¬ 
edge is the use of a knowledge base containing norma¬ 
tive statements formalized in deontic logic. These normative 
statements enable the system to reason in a rational way. 
Since deontic logic can be translated into description log¬ 
ics, Hyper can be used to reason on such knowledge bases. 
Reasoning in defeasible logic is another technique contained 
in the reasoning module of the LogAnswer system. With the 
help of defeasible logic reasoning, different proofs produced 
by Hyper are compared. The proofs found by Hyper provide 
answers to the given question by means of variable assign¬ 
ments. Comparing the proofs for different answer candidates 
therefore is used to determine the best answer. Hence defea¬ 
sible logic is contained in the answer validation module. In 
addition to that, the answer validation module contains de¬ 
cision tree learning to rank different proofs found by Hyper 
and case-based reasoning. Details on the use of case-based 
reasoning and reasoning in defeasible logic, that can both be 
used in the answer validation phase (see Figure [I]), can be 
found in the Section [2] 


2 Searching for Good Answers 


As depicted before, the reasoning component of the LogAn¬ 
swer system delivers proofs, which represent the possible 
answers to the given question. The proofs are ranked by de¬ 
cision trees which take into account several attributes of the 
reasoning process together with the attribute from the previ¬ 
ous information retrieval step. 

In addition to this ranking, we experiment with dif¬ 
ferent other techniques to improve the evaluation of an¬ 
swers. These are case-based reasoning (CBR) (Section |2T] ), 
defeasible reasoning (Section |2.2| ), and normative (de¬ 
ontic) reasoning (Section |2.3| ). To perform systematic 
and extensive tests with LogAnswer, we used the CLEF 
database, strictly speaking, its question answering part. 
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CLEF stands for cross-language evaluation forum, see www. 
clef-campaign.org It is an international campaign pro¬ 
viding language data in different languages, e.g. from news¬ 
paper articles. Its workshop and competition series contains 
a track on question answering. We used data from CLEF- 
2007 and CLEF-2008 fT2lfT8l . 

2.1 CBR Similarity Measures and Machine Learning 

Answer validation can be enhanced by using experience 
knowledge in form of cases in a case base. The resulting sys¬ 
tem module is designed as a learning system and based on a 
dedicated CBR control structure. Contrary to common pro¬ 
cedures in natural-language processing, however, we do not 
follow the textual approach, where experiences are available 
in unstructured or semi-structured text form, but use a struc¬ 
tured approach along the lines of (4). This is possible be¬ 
cause the knowledge source is available not only in textual 
but also in a logical format. The semantics of the natural- 
language text is given basically by first-order predicate logic 
formulae, which are represented by MultiNet graphs tm 
Our basis is a manually achieved classification for each pair 
of question (from the CLEF 2007 and 2008 data) and an¬ 
swer candidate (from the LogAnswer system) whether the 
answer candidate is a good one for the question. In order to 
compare and to define a similarity measure of the MultiNet 
graphs, we have developed a new graph similarity measure 
1114112211 which improves other existing measures, e.g. EH. 

We measured the CBR system classification accuracy by 
running tests with a case base from the CLEF 2007 and 2008 
data. Our overall test set had 254 very heterogeneous ques¬ 
tions and ca. 15000 cases. For instance, in one of the evalua¬ 
tions, namely the user interaction simulation (see Figure [3]), 
we examined the development of the results for a growing 
knowledge base. We simulated users that give reliable feed¬ 
back to new, heterogeneous questions for which the Log- 
Answer system provides answers candidates. The test set¬ 
ting was to guess the classification of questions and answer 
candidates the system does not have in the knowledge base. 
The results show the increase of the classification accuracy 
with a growing number of correct cases in the case base. 
We performed a number of other evaluation experiments, 
e.g. 3- and 10-fold cross validations. For more information 
about the integrated CBR/Machine learning evaluation and 
test settings, please refer to 1 14112211 . 

We further integrated case-based reasoning into the al¬ 
ready existing answer selection techniques in LogAnswer 
(answer validation phase, see Figure [I]). For this, the re¬ 
sults of the CBR stage were turned into numeric features. A 
ranking model determined by a supervised leaming-to-rank 
approach combined these CBR-based features with other 
answer selection features determined by shallow linguistic 
processing and logical answer validation The final machine 



Figure 3 The x-axis is the number of cases in the case base. The y- 
axis is the classification accuracy in percent, for correct and incorrect 
answer candidates, as well as the overall classification accuracy for the 
user interaction simulation. 

learning ranker is an ensemble of ten rank-optimizing de¬ 
cision trees, obtained by stratified bagging, whose individ¬ 
ual probability estimates are combined by averaging. When 
training the machine learning ranker on a case base opti¬ 
mized for perfect treatment of correct answer candidates, we 
get the best overall result in our tests with a mean reciprocal 
rank (MRR) of 0.74 (0.72 without CBR) and a correct top- 
ranked answer chosen in 61% (58% without CBR) of the 
cases. It is instructive to consider the usage of CBR features 
in the machine learning ranker, by inspecting all branching 
conditions in the generated trees and counting the frequency 
of occurrence of each feature in such a branching condition, 
since 10 bags of 10 decision trees were generated in the 10 
cross-validation runs, there is a total of 100 trees to base re¬ 
sults on mm In total, 42.5% of all split conditions in the 
learned trees involve one of the CBR attributes. This further 
demonstrates the strong impact of CBR results on answer 
re-ranking. 

2.2 The Specificity Criterion 

More specific answer candidates are to be preferred to less 
specific ones, and we can compare them according to their 
specificity as follows. To obtain what argumentation theo¬ 
ries call an argument , we form a pair of an answer candidate 
and its derivation. The derivation can be based on positive- 
conditional rules , generated from Hyper’s verifications and 
capturing the Wikipedia page of the answer candidate and 
the linguistic knowledge actually applied. Now we find our¬ 
selves in the setting of defeasible reasoning and can sort the 
arguments according to their specificity. 

In defeasible reasoning, certain knowledge is assumed 
to be defeasible. Strict knowledge, however, is specified by 
contingent facts (e.g. in the emu example from Section [T] 
“Tom is an emu”) and general rules holding in all possi- 
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ble worlds without exception (e.g. “emus do not fly”)- Strict 
knowledge is always preferred to knowledge depending also 
on defeasible rules (e.g. “birds normally fly”). 

Already in 1985, David Poole had the idea to prefer 
more specific arguments in case of conflicting results as fol¬ 
lows G9): For any derivation of a given result, represented 
as a tree, consider the sets of all leaves that contribute to the 
applications of defeasible rules. An activation set is a set of 
literals from which all literals labeling such a set of leaves is 
derivable. Thereby, an activation set is sufficient to activate 
the defeasible parts of a derivation in the sense of a presup¬ 
position, without using any additional contingent facts. 

One argument is now more specific than another one if 
all its activation sets are activation sets of the other one. This 
means that each activation set of the more specific argument 
(seen as the conjunction of its literals) must be more spe¬ 
cific than an activation set of the other one. Note that the 
meaning of the latter usage of the word “specific” is just the 
traditional common-sense concept of specificity , according 
to which a criterion (here: conjunction of literals) is more 
specific than another one if it entails the other one. 

We discovered several weaknesses of Poole’s relation, 
such as its non-transitivity: Contrary to what is obviously 
intended in m and “proved” in lf20lL Poole’s relation is not 
a quasi-ordering and cannot generate an ordering. We were 
able to cure all the discovered weaknesses by defining a 
quasi-ordering mm a.e. a reflexive and transitive binary 
relation), which can be seen as a correction of Poole’s rela¬ 
tion, maintaining and clarifying Poole’s original intuition. 

The intractability of Poole’s relation, known at least 
since 2003 ['2T|, was attenuated by our quasi-ordering and 
then overcome by restricting the rules to instances that were 
actually used in the proofs found by Hyper, and by treating 
the remaining variables (if any) as constants. With these re¬ 
strictions, the intractability did not show up anymore in any 
of the hundreds of examples we tested with our PROLOG 
implementation. 

Running this implementation through the entire CLEF- 
2008 database, almost all suggested answer solutions turned 
out to be incomparable w.r.t. specificity, although our quasi¬ 
ordering can compare more arguments in practice than 
Poole’s original relation. One problem here is that we have 
to classify the rules of the CLEF examples as being either 
general or defeasible, but there is no obvious way to classify 
them. Another problem with the knowledge encoded in the 
MultiNet formalism is that it first and foremost encodes only 
linguistic knowledge, e.g., who is the agent of a given sen¬ 
tence. Only little background knowledge is available, such 
as on ontology. All data from the web pages, however, are 
represented by literals. 

To employ more (defeasible) background knowledge we 
investigated other examples, such as the emu example from 
Section [I] Here, the formalization in first-order logic of the 


natural-language knowledge on individuals can be achieved 
with the Boxer system CT7I. which is dedicated to large- 
scale language processing applications. These examples can 
be successfully treated with the specificity criterion and also 
with deontic logic (see subsequent section). 

2.3 Making Use of Deontic Logic 

Normative statements like “you ought not steal” are om¬ 
nipresent in our everyday life, and humans are used to do 
reason with respect to them. Since norms can be helpful 
to model rationality, they constitute an important aspect for 
common-sense reasoning. This is why normative reasoning 
is investigated in the RatioLog project Do). Standard deon¬ 
tic logic (SDL) fill is a logic which is very suitable for the 
formalization of knowledge about norms. SDL corresponds 
to the modal logic K together with a seriality axiom. In SDL 
the modal operator □ is interpreted as “it is obligatory that” 
and the O operator as “it is permitted that”. For example a 
norm like “you ought not steal” can be intuitively formal¬ 
ized as □-! steal. From a model theoretic point of view, the 
seriality axiom contained in SDL ensures that, whenever it 
is obligatory that something holds, there is always an ideal 
world fulfilling the obligation. 

In the RatioLog project, we experiment with SDL by 
adding normative statements into the background knowl¬ 
edge. The emu example from Section [T] contains the nor¬ 
mative assertion 

Birds normally fly. 

which can be modeled using SDL as 
Bird nFlies 

and is added to the background knowledge. In addition to 
normative statements, the background knowledge further¬ 
more contains assertions not containing any modal opera¬ 
tors, e.g. something like the statement that all emus are birds. 
Formulae representing contingent facts, like the assertion 

Tom is an emu. 

in the emu example, are combined with the background 
knowledge containing information about norms. The Hy¬ 
per theorem prover 0 can be used to analyze the resulting 
knowledge base. For example, it is possible to ask the prover 
if the observed world with the emu Tom fulfills the norm that 
birds usually are able to fly. 

Within the RatioLog project both defeasible logic and 
deontic logic are used. There are similarities between de¬ 
feasible logic and deontic logic. For example in defeasible 
logic there are rules which are considered to be not strict but 
defeasible. These defeasible rules are similar to normative 
statements, since norms only describe how the world ought 
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to be and not how it actually is. This is why we are also in¬ 
vestigating the connection between these two logics within 
the RatioLog project. 


3 Conclusions 

Deep question answering does not only require pattern 
matching and indexing techniques, but also rational reason¬ 
ing. This has been investigated within the RatioLog project 
as demonstrated in this article. Techniques from machine 
learning with similarity measures and case-based reasoning, 
defeasible reasoning with (a revision of) the specificity crite¬ 
rion, and normative reasoning with deontic logic help to se¬ 
lect good answer candidates. If the background knowledge, 
however, mainly encodes linguistic knowledge — without 
general common-sense world knowledge — then the effect 
on finding good answer candidates is low. Therefore, future 
work will concentrate on employing even more background 
world knowledge (e.g. from ontology databases), so that ra¬ 
tional reasoning can be exploited more effectively when ap¬ 
plied to this concrete knowledge. 
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